Dry eye disease (DED) is a widespread, debilitating ocular condition that costs the NHS millions each year (Chiang and Wolffsohn, 2025) (Hossain et al., 2021). This study looks for trends in data from Public Health Scotland (PHS) to identify risk factors for DED. The research question is: Do seasons, rurality, sex, and age impact prescriptions of DED in Scotland? The results indicate that sex and age are possible risk factors, while seasonal variation did not show an impact, and rurality results were inconclusive.
The study uses the following data from PHS:
In the Data by Prescriber Location data set, the data selected was classified as ‘Tear deficiency, eye lubricant and astringent’ using the BNF item code (which can be determined in this table: https://opendata.nhsbsa.net/dataset/bnf-code-information-current-year).
The study uses paid_quantity to measure volume of prescriptions. According to PHS, this is defined as: “the number of doses or uses within a single item, rather than the quantity of the item itself.”
# Load required packages
library(tidyverse)
library(janitor)
library(gt)
library(scales)
library(plotly)
# Here the data is loaded programatically from local files instead of manually which keeps the coding more DRY and makes it easier for datasets to be added or removed without major disruption.
prescription_data <- list.files("data/prescription_data", full.names = TRUE) %>%
set_names() %>%
map(~ read_csv(.x) %>%
rename(any_of(c(HBT = "HBT2014"))) %>% # A minority of the Data by Prescriber Location's column for their healthboard is named 'HBT2014', and they need to be renamed to 'HBT' in order to successfully combined into a single dataset.
select(HBT, BNFItemCode, BNFItemDescription, GPPractice, NumberOfPaidItems, PaidQuantity, GrossIngredientCost, PaidDateMonth))
# A function is not used for data by prescriber location since the data is only bound after filtering for DED, as the data is otherwise too large and results in memory errors.
ded_prescriptions <- prescription_data %>%
map(~ .x %>% filter(grepl("110801", BNFItemCode))) %>% # This allows to filter only for the BNFItemCodes that begin with 110801 - the DED prescriptions.
bind_rows() %>%
clean_names()
read_folder <- function(path, select_cols = NULL) { # This function allows the rest of the datasets to be loaded and combined, and makes the code more DRY.
list.files(path, full.names = TRUE) %>%
set_names() %>%
map(read_csv) %>%
bind_rows() %>%
clean_names()}
gp_details <- read_folder(path = "data/gp_details", select_cols = c("PracticeCode", "DataZone", "PracticeListSize"))
gp_populations <- read_folder(path = "data/gp_populations")
urban_rural_classification <- read_folder(path = "data/urban_rural_classification", select_cols = c("DataZone", "UrbanRural8fold2020"))
Studies have shown that DED is more common in patients who spend time outdoors, due to the impact of weather, pollution and allergies (Vidal-Rohra et al., 2023). This suggests that prescriptions may vary in relation to seasons. To measure seasonal variation, the volume of prescriptions per month was calculated and assigned to a season. This was faceted over 3 years. Trend lines were drawn to allow the overall trends to be observed.
# This calculates the total paid quantity of DED prescriptions for each month, then assign each month to its respective season.
ded_seasonal_summary <- ded_prescriptions %>%
mutate(month = paid_date_month %% 100, # Since the dates comes in the format YYYYMM, you can divide by 100 and take the remainder to find the month.
year = paid_date_month %/% 100, # This gives the year by dividing by 100 and dropping the remainder. The remaining integers is the year.
month_name = month.name[month]) %>% # Assigns the actual name of the month based on the month's corresponding number from 1-12.
group_by(month_name, month, year) %>%
summarise(total_paid_quantity = sum(paid_quantity)) %>%
mutate(month_name = factor(month_name, levels = c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December")), # Ensures the x-axis of months appears in the correct order.
season = case_when(
month %in% c(3, 4, 5) ~ "Spring",
month %in% c(6, 7, 8) ~ "Summer",
month %in% c(9, 10, 11) ~ "Autumn",
month %in% c(12, 1, 2) ~ "Winter"),
season = factor(season, levels = c("Winter", "Spring", "Summer", "Autumn"))) # Here, the months are assigned to their corresponding season, then set to appear in the correct order at the side of the graph.
my_breaks <- function(x) pretty(x, n = 3) # This function is used in the ggplot to set the total number of breaks of each graph (the amount of labels on the y-axis) to 3, so that the plot is less messy.
# Plots the total paid quantity by month, groups to 1 to ensure 1 continuous line, colours by season, draws a curved line, and facets by year.
ggplot(ded_seasonal_summary, aes(x = month_name, y = total_paid_quantity, group = 1, color = season)) +
geom_line(size = 1.5) +
geom_point(size = 2) +
scale_color_manual(values = c("Spring" = "green3", "Summer" = "orange", "Autumn" = "brown1", "Winter" = "blue")) + # Assigns specific colours for each season.
labs(
x = "Month",
y = "Total Paid Quantity",
title = "Monthly Paid Quantity of DED Prescriptions (2022-2024)",
subtitle = "Coloured by Season",
color = "Season") +
theme_minimal(base_size = 14.5) +
theme(
axis.text.x = element_text(angle = 45, hjust = 1), # Tilts the x-axis labels by 45 degrees to be more readable.
plot.title = element_text(size = 14.5, face = "bold"),
plot.subtitle = element_text(size = 12)) +
geom_smooth(method = "loess", se = FALSE, linetype = "dashed", color = "black") + # Method = "loess" smooths data using local regression, which essentially means it creates a curved line around the data points.
facet_grid(year ~ ., scales = "free_y") + # Facets the graph by each year. Scales = "free_y" means that the y-axis for each graph do not have to follow the same scale. This way, the scale for each graph will better fit their respective data.
theme(panel.spacing = unit(1.5, "lines")) + # Creates spacing between each of the faceted plots.
scale_y_continuous(breaks = my_breaks) # Function that sets total number of labels on y-axis per graph to 3.
The trend lines over 3 years do not show any discernible pattern. Prescriptions were broadly constant over 2022. They rose slightly over spring and summer in 2023, and fell throughout the year in 2024. The results do not indicate that prescriptions vary by season.
There are some limitations. Actual weather conditions are not used, and weather may vary within seasons and from year to year. There may be regional variations in climate and differences between urban and rural locations. For the next steps, further studies could compare prescription volumes with actual temperature and weather patterns.
Air pollution and outdoor activity have been correlated with DED (Vidal-Rohra et al., 2023). In addition, it is postulated that proximity to urban centres may improve access to prescriptions. As these factors may correspond to rural and urban settings, this study looks at the impact of urban and rural location on prescriptions.
To measure this, the volumes of prescriptions from all GP practices was sorted by the categories in the Urban Rural Classification 2020. Then each category was divided by the respective total number of registered patients. This was calculated for 3 separate years.
# The GP Practice Contact Details and List Sizes dataset provides the data zone for each practice. The Urban Rural Classification 2020 assigns rurality by data zones. Therefore, joining by data_zone allows the rurality of each practice to then be determined.
gp_rurality_classified <- inner_join(gp_details, urban_rural_classification, by = "data_zone") %>%
select(practice_code, urban_rural8fold2020)
# Yearly prescriptions of DED are calculated.
ded_yearly_summary <- ded_prescriptions %>%
mutate(year = paid_date_month %/% 100) %>%
group_by(year, gp_practice) %>%
summarise(total_paid_quantity = sum(paid_quantity, na.rm = TRUE))
# Practice sizes for each year are calculated for each GP practice.
gp_yearly_populations <- gp_populations %>%
mutate(year = (date %/% 10000)-1) %>%
filter(sex == "All") %>%
group_by(year, practice_code) %>%
summarise(total_all_ages = sum(all_ages, na.rm = TRUE))
# These datasets are joined to later show the number of DED prescriptions each year by GP practice.
ded_by_population_yearly <- full_join(ded_yearly_summary, gp_yearly_populations, by = c("gp_practice" = "practice_code", "year" = "year"), relationship =
"many-to-many")
# Calculates the total paid quantity, the total practice population, and paid quantity per person for each of the 8 categories in the Urban Rural Classification 2020 8 fold.
ded_rurality_summary <- full_join(ded_by_population_yearly, gp_rurality_classified, by = c("gp_practice" = "practice_code"), relationship =
"many-to-many") %>%
filter(!is.na(urban_rural8fold2020)) %>% # Any missing data is cleared.
group_by(year, urban_rural8fold2020) %>%
summarise(total_paid_quantity = sum(total_paid_quantity, na.rm = TRUE),
total_all_ages = sum(total_all_ages, na.rm = TRUE)) %>%
mutate(itemsperperson = total_paid_quantity / total_all_ages, urban_rural8fold2020 = factor(urban_rural8fold2020, levels = 1:8, labels = c("Large Urban Areas [1]", "Other Urban Areas [2]", "Accessible Small Towns [3]", "Remote Small Towns [4]", "Very Remote Small Towns [5]", "Accessible Rural Areas [6]", "Remote Rural Areas [7]", "Very Remote Rural Areas [8]"))) %>% # Each of the classifications are given their corresponding name.
select(year, urban_rural8fold2020, itemsperperson) %>%
pivot_wider(names_from = year, values_from = itemsperperson) # Converts the data from long to wide format so that each year becomes its own column with the corresponding items per person.
# Creates a table with a heatmap effect that includes paid quantity per person by Urban Rural Classification from 2022-2024.
ded_rurality_summary %>%
gt(rowname_col = "urban_rural8fold2020") %>% # Rows are set to each of the 8 classifications.
tab_header(title = md("**DED Prescribing by Urban Rural Classification (2024)**")) %>%
fmt_number(
columns = c(`2022`, `2023`, `2024`), # This sets the columns as 2022, 2023 and 2024.
decimals = 3) %>%
tab_spanner(
label = "Paid Quantity per Person",
columns = c(`2022`, `2023`, `2024`)) %>% # Above the columns 2022, 2023 and 2024, the spanner 'Paid Quantity per Person' is added.
tab_style( # This is where the stylisation start.
style = cell_text(align = "center", weight = "bold"), # Spanner text is made bold and centred.
locations = cells_column_spanners(spanners = "Paid Quantity per Person")
) %>%
tab_style(
style = cell_text(align = "center"), # Contents are centred.
locations = cells_body(columns = everything())
) %>%
tab_style(
style = cell_text(align = "center", weight = "bold"),
locations = cells_column_labels(columns = everything()) # This centres all the column labels and makes them bold.
) %>%
tab_style(
style = list(cell_text(weight = "bold")), locations = cells_title(groups = "title")) %>% # Makes title bold.
tab_style(
style = cell_borders(sides = "bottom", color = "black", weight = px(2)), locations = cells_column_labels()) %>% # This creates an underline border at the column labels to create a visible division with the table contents.
tab_options(
table.width = pct(100), # Scales the width of the table to fit the whole screen, otherwise the table stays too thin.
table.font.size = 14.5,
heading.title.font.size = 16,
heading.subtitle.font.size = 14.5,
column_labels.background.color = "lightgray") %>% # Makes the column label background darker, which in combination with the black underline border, helps the table have better contrast.
data_color(
columns = c(`2022`, `2023`, `2024`),
colors = scales::col_numeric(
palette = c("lightyellow", "orange", "brown1"), # This creates a heatmap effect on the table using a colour scale based on paid quantity per person.
domain = NULL))
| DED Prescribing by Urban Rural Classification (2024) | |||
|
Paid Quantity per Person
|
|||
|---|---|---|---|
| 2022 | 2023 | 2024 | |
| Large Urban Areas [1] | 0.939 | 0.841 | 0.767 |
| Other Urban Areas [2] | 1.074 | 0.977 | 0.937 |
| Accessible Small Towns [3] | 1.108 | 0.984 | 0.888 |
| Remote Small Towns [4] | 1.606 | 1.431 | 1.286 |
| Very Remote Small Towns [5] | 1.330 | 1.158 | 1.024 |
| Accessible Rural Areas [6] | 1.236 | 1.057 | 0.958 |
| Remote Rural Areas [7] | 1.163 | 0.930 | 0.877 |
| Very Remote Rural Areas [8] | 1.355 | 1.002 | 0.787 |
In all 3 years the lowest average dosage per person is in Large Urban Areas and the highest is in Remote Small Towns, while the second highest is Very Remote Small Towns. This would indicate that prescription levels are higher in Small Towns, however Accessible Small Towns do not fit this trend. The other results show inconsistent patterns. All the prescriptions decline slightly over the 3-year period. Remote Rural and Very Remote Rural Areas decline noticeably over the 3-year period.
Accessible Small Towns and Urban Areas would be expected to have easier access to medical care and higher pollution, however they show lower prescription levels. Rural areas might be expected to have lower prescriptions and cleaner air, but the prescription levels are inconsistent. However, the pattern is not entirely random as the highest and lowest areas are consistent over three years. This suggests that there may be other factors influencing the results.
This study is limited to only 3 years. A more extensive study might reveal more distinct trends. A finer breakdown of categories may also reveal more significant trends. The next steps would be to extend the timescale of the data, for 5 or 10 years, and to use more categories for the urban rural classification.
Research has shown that females are more likely to suffer from DED than males (Hossain et al., 2021) (Vidal-Rohra et al., 2023). To explore this trend, the study compared the ratio of female to male patients in each GP practice with the average volume of prescriptions per person for the same practice in 2024. A log scale was used to prevent extremes that might stretch the plot, which makes the trends easier to see.
# DED prescriptions are calculated by GP practice in 2024.
ded_by_gp_2024 <- ded_prescriptions %>%
filter(str_starts(as.character(paid_date_month), "2024")) %>%
group_by(gp_practice) %>%
summarise(total_paid_quantity = sum(paid_quantity, na.rm = TRUE))
# A new dataset is created which contains only GP population data from 2024.
gp_populations_2024 <- gp_populations %>%
filter(str_starts(as.character(date), "2024"))
# Calculates the female to male ratio of each practice by dividing the total female population by the total male population.
gp_sex_ratio_2024 <- gp_populations_2024 %>%
group_by(sex, practice_code, date) %>%
summarise(total_all_ages = sum(all_ages, na.rm = TRUE)) %>%
pivot_wider(names_from = sex, values_from = total_all_ages) %>% # Transforms data from a long to a wide dataset, making Female and Male into columns with the corresponding number of patients.
mutate(female_to_male_ratio = Female / Male) %>%
filter(!is.na(female_to_male_ratio))
# Joins the total paid quantity of each practice in 2024 dataset with the sex ratio by gp dataset. Then it calculates the average paid quantity per person for each practice.
ded_by_sex_ratio_2024 <- inner_join(ded_by_gp_2024, gp_sex_ratio_2024, by = c("gp_practice" = "practice_code")) %>%
mutate(rate_per_person = total_paid_quantity / All)
# Plots the ratio of female to male patients in each GP practice with the paid quantity per person of each practice.
(ggplot(ded_by_sex_ratio_2024, aes(x = female_to_male_ratio, y = rate_per_person, size = All, color = All)) +
geom_point(alpha = 0.7) +
geom_smooth(method = "lm", linetype = "dashed", color = "black", inherit.aes = FALSE, # Standard error is left on.
aes(x = female_to_male_ratio, y = rate_per_person)) + # The aesthetic from ggplot are not inherited and instead regression is only based on these variables. If this is not done, it will try and interpret size and colour as part of the regression, resulting in errors.
scale_x_log10(labels = scales::comma_format(accuracy = 0.01)) + # This creates a logarithmic scale, which prevents extreme ratios from stretching the plot and makes trends easier to see.
scale_y_continuous(labels = scales::comma_format(accuracy = 0.01)) + # Keeps the y-axis to 2 decimal points for consistency.
scale_size_continuous(range = c(2, 8), guide = "none") + # Scales points proportionally to practice population size. Smallest size is 2 and largest is 8. The legend is also removed so that only the colour scale appears.
scale_color_gradient(low = "yellow", high = "brown1", name = "Practice Population") + # Creates a colour gradient based on practice population size and adds a colour scale.
labs(x = "Female-to-Male Ratio (log scale)",
y = "Paid Quantity per Person",
title = "Effect of Female-to-Male Ratio on DED Prescription Rate (2024)"
) +
theme_minimal(base_size = 14.5) +
theme(
plot.title = element_text(size = 14.5, face = "bold"),
plot.subtitle = element_text(size = 12))) %>%
ggplotly() # Makes the graph interactive.
The results show a positive relationship. As the ratio of female to male increases in GP practices, there is an increase in the prescription of DED items across GP practices, consistent with previous research.
One limitation is this only looks at the total number of prescriptions in proportion to the number of females and males in each practice and does not show actual prescriptions by sex, which does not guarantee that it is the females who are being prescribed the DED. The next step would be to get data on the actual prescriptions by sex.
Studies show that prevalence of DED increases with age, with the majority of sufferers aged over 50 (Hossain et al., 2021). To explore this trend, this study calculated weighted average ages for each GP practice. This was done because the data is provided in age groups and using a weighted average allows it to be plotted in a continuous line. These weighted averages were then compared with the total volume of prescriptions per person of each practice.
# Calculates weighted average ages for each GP practice.
gp_population_normalised <- gp_populations_2024 %>%
rowwise() %>% # Performs each row independently
mutate(weighted_age = (ages0to4*2.5 + ages5to14*10 + ages15to24*20 + ages25to44*34.5 + ages45to64*54.5 + ages65to74*69.5 + ages75to84*79.5 + ages85plus*90) / all_ages) # Weighted age is calculated by taking the middle number in each age group and multiplying that by the number of people in each age group, then dividing by the total ages.
# Paid quantity by person is calculated for each GP practice by joining with the 2024 DED prescriptions by practice dataset.
ded_by_normalised_population <- inner_join(ded_by_gp_2024, gp_population_normalised, by = c("gp_practice" = "practice_code")) %>%
filter(sex == "All") %>%
mutate(rate_per_person = total_paid_quantity / all_ages) %>%
filter(!is.na(weighted_age))
# Weighted averages are plotted against the paid quantity per person of each practice.
(ggplot(ded_by_normalised_population, aes(x = weighted_age, y = rate_per_person, size = all_ages, color = all_ages)) +
geom_point(alpha = 0.7) +
geom_smooth(method = "lm", linetype = "dashed", color = "black", inherit.aes = FALSE, # Standard error is left on.
aes(x = weighted_age, y = rate_per_person)) + # Same reasoning as the previous graph, we do not want the linear model to inherit the size and colour aesthetics, otherwise there will be errors.
scale_x_continuous(labels = scales::comma_format(accuracy = 0.01)) +
scale_y_continuous(labels = scales::comma_format(accuracy = 0.01)) + # Keeps X and Y-axis consistent to 2 decimal points.
scale_size_continuous(range = c(2, 8), guide = "none") + # A scaling size by practice population is also given for this graph, and the legend removed.
scale_color_gradient(low = "yellow", high = "brown1", name = "Practice Population") + # Again, a colour gradient based on practice sizes and a colour scale is added.
labs(x = "Weighted Mean Age of Practice Population",
y = "Paid Quantity per Person",
title = "Weighted Mean Age on DED Prescription Rate (2024)")+
theme_minimal(base_size = 14.5) +
theme(
plot.title = element_text(size = 14.5, face = "bold"),
plot.subtitle = element_text(size = 12))) %>%
ggplotly() # Graph is made to be interactive.
The results indicated a positive relationship between weighted mean age and prescription of DED items per person. This supports the existing research.
One limitation is that this plot uses the total number of prescriptions of each GP practice in proportion to the weighted mean age. It is not actual prescription per patient data, but a prediction. The next steps would be to gather the prescription data from a sample of actual patients and look for trends.
This study has considered whether seasonal variation, rurality, sex and age affect DED prescriptions in Scotland. The results suggest possible positive correlations with sex and age, which is consistent with previous research. Seasonal variation did not appear to be a risk factor, and the results of the rurality study indicated some patterns, but not as expected. The study revealed that total prescriptions have declined over the 3-year period of the study. The reasons for this decline need to be investigated. Further research with more detailed data and longer timescales may give more conclusive results.